
This notebook attempts to generate monophonic music in the style of Johann Sebastian Bach. It will do so by utilizing Long short-term memory (LSTM) RNN's.
The input of this network will be a single MIDI file, and the output will be a new MIDI file, generated by the network, in a style similar to the input piece.
Below is a brief overview of this process:

Once the MIDI is generated, it is converted to a wav file via wildmidi.
In this notebook music21, an open-source music library written by MIT will be used to read and create MIDI files. Keras will be used to process the MIDI data and generate new pieces of music. Additionaly matplotlib will be used to plot various graphs.
Below these elements are imported:
import os
from music21 import converter, corpus, instrument, midi, note, chord, pitch, stream
import numpy as np
import keras
from keras import models
from keras import layers
from keras.utils.np_utils import to_categorical
from matplotlib import pyplot
import IPython.display as ipd
Firstly, a MIDI file is read, converting the contents to a music21 "stream" object. This converts the binary data of the MIDI file into an easy readable format. Due to a lack of computational resources, only one piece at a time will be used in this notebook.
For this notebook, Bach's Contrapunctus I from The Art of Fugue will be used:
def readMidi(filename):
mf = midi.MidiFile()
mf.open(filename)
mf.read()
mf.close()
return midi.translate.midiFileToStream(mf)
midiTrack = readMidi("bach.mid")
The extracted MIDI data consists of a stream of note data, each containing multiple "parts". A part represents a single monophonic pattern of notes. Below frequencies of each note of each stream are extracted, and plotted in a visualisation. We begin by extracting the freqency data from the piece:
def createFreqArray(track):
returnArray = []
for i in range(0, len(track.parts)):
tmpArray = []
section = track.parts[i].flat.notes
for thisNote in section:
if isinstance(thisNote, note.Note):
pitch = str(thisNote.pitch.frequency)
tmpArray.append(pitch)
returnArray.append(tmpArray)
return returnArray
freqPattern = createFreqArray(midiTrack)
Then proceed to plot these frequencies onto a graph:
def visualiseFreqArray(array):
for i in range(len(array)):
x = range(len(array[i]))
y = array[i]
pyplot.scatter(x, y, alpha=0.5)
pyplot.title("Visualisation of piece")
pyplot.ylabel("Frequency (hz)")
pyplot.xlabel("Step")
pyplot.show()
visualiseFreqArray(freqPattern)
Sequences of note names and the duration of each note are extracted into arrays in order to further examine these parameters. In order to obtain every note in the piece of music, a function is needed that loops through each part, then, loops through each note of the part. Since we are only going to be generating monophonic music, each part's notes is concatenated to a long array, read to be processed.
Two functions are created, one for extracting each feature (Note and duration) from the MIDI stream:
# For extracting notes
def createNoteArray(track):
returnArray = []
for i in range(0, len(track.parts)):
section = track.parts[i].flat.notes
for thisNote in section:
if isinstance(thisNote, note.Note):
pitch = str(thisNote.pitch)
returnArray.append(pitch)
return returnArray
# For extracting durataion data
def createDurationArray(track):
returnArray = []
for i in range(len(track.parts)):
section = track.parts[i].flat.notes
for thisNote in section:
if isinstance(thisNote, note.Note):
dur = str(thisNote.duration.quarterLength)
returnArray.append(dur)
return returnArray
notePattern = createNoteArray(midiTrack)
durationPattern = createDurationArray(midiTrack)
In order to feed the data into the network it needs to be encoded. This is done via one-hot encoding, for both the note and duration data:
noteDict = {}
durationDict = {}
encodedNote = []
encodedDuration = []
def encodeNote(data, dict):
returnCode = 0
if data not in dict:
returnCode = len(dict)
dict[data] = returnCode
else:
returnCode = dict[data]
return returnCode
# Encode note data
for rawNote in notePattern:
encodedNote.append(encodeNote(rawNote, noteDict))
# Encode duration data
for rawDur in durationPattern:
encodedDuration.append(encodeNote(rawDur, durationDict))
In order to train the network a dataset must be created from the encoded data. This dataset will consist of two parts: The training data and the training labels.
In order to create the training data patterns of note and duration of $n$ length are taken from the encoded data, with a stride of one each time. The label data is simply the note or duration which directly follows the pattern. This can be seen below:
def createData(encodedData, dictLen):
sequenceLen = dictLen
x = []
y = []
for i in range(0, len(encodedData) - sequenceLen):
thisX = encodedData[i:i + sequenceLen]
x.append(thisX)
thisY = encodedData[i + sequenceLen]
y.append(thisY)
x = np.reshape(x, (len(x), sequenceLen, 1))
y = to_categorical(y)
return x, y
xNote, yNote = createData(encodedNote, len(noteDict))
xDuration, yDuration = createData(encodedDuration, len(durationDict))
Two models need to be created. Multiclass classification will be performed on two sets of data: Note values, and note durations. As both of these are sequence data which rely heavily on previous values, a LSTM layer is used to first process the data. Next, the output will be fed into a dense layer which outputs a prediction of which catagory the pattern input should corrispond with. Since this is multiclass classifcation the catagorical crossentropy loss function shall be used:
# Model for notes
noteModel = keras.models.Sequential()
noteModel.add(layers.LSTM(128, input_shape=(xNote.shape[1], xNote.shape[2])))
noteModel.add(layers.Dense(len(yNote[0]), activation='softmax'))
noteModel.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Model for note duration
durationModel = keras.models.Sequential()
durationModel.add(layers.LSTM(128, input_shape=(xDuration.shape[1], xDuration.shape[2])))
durationModel.add(layers.Dense(len(yDuration[0]), activation='softmax'))
durationModel.compile(loss='categorical_crossentropy', optimizer='rmsprop')
The two models need to be trained. They are both trained for 250 epochs and their history is recorded.
print("==Training note model==")
noteModelHistory = noteModel.fit(xNote, yNote, epochs=250)
print("==Training duration model==")
durationModelHistory = durationModel.fit(xDuration, yDuration, epochs=250)
The loss of each model is plotted in order to see how the training has progressed:
def plotLoss(history, name):
pyplot.plot(history.history['loss'])
pyplot.title(name + " loss per epoch")
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
plotLoss(noteModelHistory, "Note model")
plotLoss(durationModelHistory, "Duration model")
The network has now reached a point where predictions can be made, and in turn, new music genearted.
def predictFeature(feature, model, length):
# Take a random sequence from the training data
tempPrediction = feature[np.random.randint(0, len(feature)-1)]
generated = []
for i in range(200):
# Make a prediction
prediction_input = np.reshape(tempPrediction, (1, len(tempPrediction), 1))
guess = model.predict(prediction_input, verbose=0)
result = np.argmax(guess)
# Add the new prediced feature and snip the temp array
tempPrediction = np.append(tempPrediction, result)
tempPrediction = tempPrediction[1:]
# Append prediction to return array
generated.append(result)
return generated
notePredictionsRaw = predictFeature(xNote, noteModel, 200)
durationPredictionsRaw = predictFeature(xDuration, durationModel, 200)
In order to retrieve the name of the notes which have been predicted by the network, reversal of the one-hot-encoding is needed. Below two functions are created, one for reversing the one hot encoded dictionaries, and one which loops through each note in the predictions, reversing the encoding:
def reverseDict(thisDict):
return {v: k for k, v in thisDict.items()}
def decodePredictions(predictions, thisDict):
decodedPredictions = []
invDict = reverseDict(thisDict)
for thisPrediction in predictions:
decodedPredictions.append(invDict[thisPrediction])
return decodedPredictions
notePredictions = decodePredictions(notePredictionsRaw, noteDict)
durationPredictions = decodePredictions(durationPredictionsRaw, durationDict)
Now the generated notes and durations can be printed out in order to ensure no strange results have been created:
print(notePredictions)
print(durationPredictions)
The next step is to convert the two arrays into a music21 score in order to eventually convert them into MIDI. Below a function takes in the generated notes and durations and outputs a score. An organ voice is chosen for the piece:
def generateScore(genNotes, genDurs):
generatedScore = stream.Score()
mainPart = stream.Part()
for i in range(len(genNotes)):
thisNote = note.Note(genNotes[i])
thisNote.duration.quarterLength=float(genDurs[i])
mainPart.append(thisNote)
mainPart.insert(0, instrument.Organ())
generatedScore.insert(0, mainPart)
return generatedScore
generatedScore = generateScore(notePredictions, durationPredictions)
Before writing the score to file, first it is visualised to ensure conversion occured correctly:
generatedFreqArray = createFreqArray(generatedScore)
visualiseFreqArray(generatedFreqArray)
And finally the score is written to the file generated.mid:
generatedScore.write("midi", "generated.mid")
In order to play the midi file it needs to be converted to wav, then finally mp3. Ways of doing this vary depending on platform. On Linux, the following command was used:
wildmidi generated.mid --wavout=generated.wav
then
yes | ffmpeg -i generated.wav generated.mp3
(with yes preventing ffmpeg expecting input from the terminal)
and finally removing the wav file
rm generated.wav
def midiToWav(filename):
cmd = "wildmidi "+filename+".mid --wavout="+filename+".wav"
print(os.popen(cmd).read())
cmd = "yes | ffmpeg -i "+filename+".wav "+filename+".mp3"
print(os.popen(cmd).read())
cmd = "rm "+filename+".wav"
print(os.popen(cmd).read())
midiToWav("generated")
(If the piece won't play, rerun the python code. Ocasionally the player will display 9:17:59, this is just a glitch and should be ignored, alternativly the wav files are available in the notebook directory)
Below is the final piece converted into wav format:
ipd.Audio('generated.mp3')
Additional attempts were made prior to the final run of the notebook, an early attempt can be found below:
ipd.Audio('generated-other-1.mp3')
ipd.Audio('generated-other-2.mp3')
Three pieces of music were generated via an LTSM network trained on one-hot-encoded data generated from Bach's Contrapunctus I from The Art of Fugue. These three pieces generally stay in key and are Bach-esque, with some interesting original note sequences and duration sequences generated.
However, whilst in the style of Bach, these pieces are often directionless and repetitive. Additionaly, on several occasions certain note sequences are verbatim copies of the original score.
One issue holding the project back was the lack of computation power. It took approximatly 20 minutes to train the network on a single MIDI file; this resulted in my laptop overheating a couple of times. Ideally I would like to try train the network using anywhere between 5-10 pieces of music. This could potentially enable the network to generate more diverse and interesting pieces.
Much more work and computation power is required in the future to improve the quality and coherence of the generated music. However, overall, I see this project as a sucess, as the network did manage to generate music resembling the style of Bach.